Importing libraries

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import plotly.express as px
from matplotlib.pyplot import plot
import seaborn as sns
%matplotlib inline 

from sqlalchemy import create_engine
import psycopg2

from datetime import timedelta, date, datetime
import pandas as pd
import urllib
from sqlalchemy import create_engine

from matplotlib import style
import warnings
warnings.filterwarnings("ignore")
import os
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.preprocessing import PowerTransformer



import numpy as np
import pandas as pd
from datetime import datetime 
from tqdm import tqdm
from sklearn.preprocessing import PowerTransformer
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from scipy import stats

from sklearn.cluster import KMeans



from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_validate
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score

import pickle
/Users/camilohernandez/opt/anaconda3/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)

Importing dataset and loading it into PosgreSQL dataBase created using 'pgAdmin 4'

The data consider just March of 2020

In [3]:
cases = pd.read_csv('Casos1.csv')
  • First review of the dataset
In [4]:
cases.head()
Out[4]:
ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito Atención** Edad Sexo Tipo* País de procedencia
0 1 06/03/2020 Bogotá Bogotá D.C. Recuperado 19 F Importado Italia
1 2 09/03/2020 Buga Valle del Cauca Recuperado 34 M Importado España
2 3 09/03/2020 Medellín Antioquia Recuperado 50 F Importado España
3 4 11/03/2020 Medellín Antioquia Recuperado 55 M Relacionado Colombia
4 5 11/03/2020 Medellín Antioquia Recuperado 25 M Relacionado Colombia
  • Loading the table into the database 'PlatziDB' created just for this purpose.
In [ ]:
engine = create_engine('postgresql://postgres:olimac@localhost:5432/PlatziDB')
cases.to_sql('cases', engine)
  • Getting the table from PosgreSQL PlatziDB database. This table is saved in a dataframe and some labels of the table are adjusted.
In [5]:
conn = psycopg2.connect('dbname=PlatziDB user=postgres host=localhost password=olimac')
cur = conn.cursor()
cur.execute('''SELECT * FROM cases;''')
DB_cases = pd.DataFrame(list(cur.fetchall()))


DB_cases.columns = ["ID_sql","ID", "date", "city", "departamento", "state", "age", "sex", "type", "procedence"]
cols_ = DB_cases.select_dtypes(include=[np.object]).columns
DB_cases[cols_] = DB_cases[cols_].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))


DB_cases.head()
Out[5]:
ID_sql ID date city departamento state age sex type procedence
0 0 1 2020-03-06 Bogota Bogota D.C. Recuperado 19 F Importado Italia
1 1 2 2020-03-09 Buga Valle del Cauca Recuperado 34 M Importado Espana
2 2 3 2020-03-09 Medellin Antioquia Recuperado 50 F Importado Espana
3 3 4 2020-03-11 Medellin Antioquia Recuperado 55 M Relacionado Colombia
4 4 5 2020-03-11 Medellin Antioquia Recuperado 25 M Relacionado Colombia
  • In this dataset there are 8 relevant columns to analyse.
    • Date: it considers only march of 2020.
    • City: City of Colombia.
    • Department: Department of colombia.
    • state: Current state of the infected person.
    • Age: Age of the person infected.
    • Sex: Sex of the person infected.
    • Type: Kind of person infected, (Ex, if the person comes from other country)
    • Procedence: Contry of origin if the case was imported

Understanding the database

In [7]:
print("Number of cases: " + str(len(DB_cases)))
print("The timeframe considered is between " + str(min(DB_cases['date'])) + " and " + str(max(DB_cases['date'])))
Number of cases: 1161
The timeframe considered is between 2020-03-06 00:00:00 and 2020-04-02 00:00:00
  • Is there any Null value in the dataset?
In [8]:
DB_cases.isnull().sum()
Out[8]:
ID_sql          0
ID              0
date            0
city            0
departamento    0
state           0
age             0
sex             0
type            0
procedence      0
dtype: int64
  • As there is not Null value in the data I will start with the analysis.
  • In the following part, I want to undertand the distribution of cases per department in order to priorize by percentage the most important ones to focus.
- The top three deparments are: Bogota DC, Valle del Cauca and Antioquia with around 68% of participation over total cases in march. The following chart displays the number of cases per department sorted. 
In [9]:
order_percent = DB_cases.departamento.value_counts(normalize = True)
print(order_percent)

plt.figure(figsize=(15,8))
sns.countplot(y="departamento", data=DB_cases, palette="Set3",order=order_percent.index.tolist())
Bogota D.C.              0.466839
Valle del Cauca          0.129199
Antioquia                0.109388
Cundinamarca             0.036176
Cartagena D.T. y C       0.033592
Risaralda                0.030146
Barranquilla D.E.        0.026701
Huila                    0.025840
Quindio                  0.019811
Norte de Santander       0.018088
Cesar                    0.013781
Caldas                   0.013781
Tolima                   0.012059
Santander                0.010336
Cauca                    0.010336
Santa Marta D.T. y C.    0.010336
Meta                     0.010336
Narino                   0.005168
Boyaca                   0.005168
Atlantico                0.003445
Bolivar                  0.002584
San Andres               0.001723
Casanare                 0.001723
Cordoba                  0.001723
Sucre                    0.000861
La Guajira               0.000861
Name: departamento, dtype: float64
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1f7d5890>

Is there any tendency between age and the state?

- Yes, it seems that people who belong to Hospital, Hospital UCI and Fallecido category tend to have a higher median of age compare to people who belong to recuperado or casa category. 
In [10]:
plt.figure(figsize=(15,8))
ax = sns.boxplot(x="state", y="age", data=DB_cases,palette="Set3")

Is there any tendency between age and type of the register (Importado, Relacionado or en estudio)?

- No, it seems that there is not a tendency between age and type of register. 
In [11]:
plt.figure(figsize=(15,8))
ax = sns.boxplot(x="type", y="age", data=DB_cases,palette="Set3")

Is there any tendency between state and sex?

- It means that the proportion between F and M for 'Recuperado' and 'Casa'category is the same. A bit more for F than M. On the other hand, for 'Hospital', 'Hosputal UCI' and ' Fallecido' category the proportiion of F and M is different. There are more M then F. 
In [12]:
plt.figure(figsize=(15,8))
ax = sns.countplot(x="state", hue="sex", data=DB_cases,palette="Set3")

Is there any tendency between state, age and sex?

- For 'Recuperado', 'Casa', 'Hospital UCI' and 'Fallecido' there is not a clear tendency between age and sex. However for 'Hospital', it seems that M tend to have a in  a lower age that F in 'Hospital' category.. 
In [13]:
plt.figure(figsize=(15,8))
ax = sns.boxplot(x="state", y="age", hue="sex",data=DB_cases, palette="Set3")

How each state bahave over the time?

- it's clear that people who are at home is increasing over the month, besides of people at the hospital who seems to decrease at the end of the month. 
In [14]:
#se determina el tamaño de la ilustración
fig, ax = plt.subplots(figsize=(15,7))

#Finalmente se genera el conteo por año de cada estado y se grafica
DB_cases.groupby([pd.DatetimeIndex(DB_cases.date).to_period("D"),'state']).count()['ID'].unstack().plot(ax=ax)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20cb4cd0>

What is the accomulative behavior of people at home?

- Seems that during march the tendency was exponential. 
In [15]:
df = DB_cases.copy()

df=df.assign(ones=1)
df=df.sort_values(['date']).reset_index(drop=True)
df["cum_ones"]=df.groupby(['state'])['ones'].cumsum(axis=0)
In [16]:
depar = ['Casa']

df_casa = df[ df['state'].isin(depar) ]


fig, ax = plt.subplots(figsize=(15,7))

df_casa.groupby([pd.DatetimeIndex(df_casa.date).to_period("D"),'state']).max()['cum_ones'].unstack().plot(ax=ax)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20f75c90>

What is the accomulative behavior for the other categories?

  • Seems that people at the hospital is increasing exponencially. Likewise, people 'Hospital UCI' started exponencially but at the end of the month decrease a bit. Moreover, the increase of ´Fallecido' categories seems to be constant.
In [17]:
depar = ['Casa']

df_casa = df[~df['state'].isin(depar) ]


fig, ax = plt.subplots(figsize=(15,7))

#Finalmente se genera el conteo por año de cada estado y se grafica
df_casa.groupby([pd.DatetimeIndex(df_casa.date).to_period("D"),'state']).max()['cum_ones'].unstack().plot(ax=ax)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20f87f50>

What is the distribution of the type of registers?

  • the imported cases have the highest number of cases, then 'Relacionado' has the second highest number and finally 'En estudio' with around 260 cases.
In [18]:
DB_cases.columns
Out[18]:
Index(['ID_sql', 'ID', 'date', 'city', 'departamento', 'state', 'age', 'sex',
       'type', 'procedence'],
      dtype='object')
In [19]:
df_proce = DB_cases.copy()

order_percent = df_proce.type.value_counts(normalize = True)
print(order_percent)
plt.figure(figsize=(15,8))
sns.countplot(y="type", data=df_proce, palette="Set3",order=order_percent.index.tolist())
Importado      0.451335
Relacionado    0.326443
En estudio     0.222222
Name: type, dtype: float64
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a207a2210>

What is the distribution of the type of registers and sex?

  • 'Importando' category tend to have more M than F. On the contrary, relacionado category has more F tha M and the last one 'En estudio' is almost the same number for both.
In [20]:
plt.figure(figsize=(15,8))
ax = sns.countplot(x="type", hue="sex", data=DB_cases,palette="Set3")

What is the distribution of the type of register and State?

  • Seems that 'En estudio' state there is not any person that has been recovered. For the others seem to have the same distribution by state.
In [21]:
plt.figure(figsize=(15,8))
ax = sns.countplot(x="type", hue="state", data=DB_cases,palette="Set3")

What is the distribution of imported people and procedence country?

  • Spain, Unitedd States, Ecuador, mexico, Turquia and panama convers around 74% of all imported cases
In [22]:
depar = ['Importado']

df_proce_importado = df_proce[ df_proce['type'].isin(depar) ]

order_percent = df_proce_importado.procedence.value_counts(normalize = True)

print(order_percent)

plt.figure(figsize=(20,20))
sns.countplot(y="procedence", data=df_proce_importado, palette="Set3",order=order_percent.index.tolist())
Espana                                             0.360687
Estados Unidos                                     0.236641
Ecuador                                            0.059160
Mexico                                             0.049618
Turquia                                            0.032443
Panama                                             0.030534
Francia                                            0.024809
Brasil                                             0.020992
Puerto Rico                                        0.017176
Londres                                            0.015267
Italia                                             0.011450
Islas San Martin                                   0.011450
Jamaica                                            0.009542
Alemania                                           0.007634
Canada                                             0.005725
Republica Dominicana                               0.005725
Inglaterra                                         0.005725
Espana - Italia                                    0.005725
Peru                                               0.003817
Grecia                                             0.003817
Argentina                                          0.003817
Egipto                                             0.003817
Aruba                                              0.003817
Italia - Espana - Turquia                          0.003817
Espana - Francia                                   0.003817
Costa Rica                                         0.003817
Turquia - Grecia                                   0.003817
Chile                                              0.003817
Francia - Holanda                                  0.001908
Belgica                                            0.001908
Jamaica - Panama - Islas del caribe - Cartagena    0.001908
Italia - Jamaica - Panama                          0.001908
Emiratos Arabes                                    0.001908
Espana - Turquia                                   0.001908
Suiza                                              0.001908
Panama - Jamaica                                   0.001908
Israel Egipto                                      0.001908
0                                                  0.001908
Marruecos                                          0.001908
Cuba                                               0.001908
Jamaica - Isla Caiman - Panama                     0.001908
Europa                                             0.001908
Croacia                                            0.001908
Italia - Ucrania - Espana                          0.001908
Venezuela                                          0.001908
Espana - Croacia - Bosnia                          0.001908
Isla Martin - Caribe                               0.001908
Espana - Egipto                                    0.001908
Madrid                                             0.001908
Espana - Croacia                                   0.001908
Guatemala                                          0.001908
Espana - India                                     0.001908
Jamaica - Panama - Isla Caiman                     0.001908
Alemania - Estambul                                0.001908
Arabia                                             0.001908
Name: procedence, dtype: float64
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a204bdd50>

What is the distribution of 'En estudio' and 'Departamento'?

  • Bogota, Valle del cauca, Antioquia Santa marta, cundinamarca and huila convers around 80% of all imported cases
In [23]:
depar = ['En estudio']

df_proce_importado = df_proce[ df_proce['type'].isin(depar) ]

order_percent = df_proce_importado.departamento.value_counts(normalize = True)

print(order_percent)

plt.figure(figsize=(20,20))
sns.countplot(y="departamento", data=df_proce_importado, palette="Set3",order=order_percent.index.tolist())
Bogota D.C.              0.701550
Valle del Cauca          0.093023
Antioquia                0.069767
Santa Marta D.T. y C.    0.023256
Cundinamarca             0.019380
Huila                    0.015504
Barranquilla D.E.        0.015504
Atlantico                0.011628
Norte de Santander       0.011628
Boyaca                   0.007752
Cartagena D.T. y C       0.007752
San Andres               0.003876
Cauca                    0.003876
Santander                0.003876
Risaralda                0.003876
Quindio                  0.003876
Casanare                 0.003876
Name: departamento, dtype: float64
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20bd2bd0>

What is the distribution of 'Relacionado' and 'Departamento'?

  • Bogota, Valle del cauca, Antioquia, Cartagena, huila and risaralda convers around 70% of all imported cases
In [24]:
depar = ['Relacionado']

df_proce_importado = df_proce[ df_proce['type'].isin(depar) ]

order_percent = df_proce_importado.departamento.value_counts(normalize = True)

print(order_percent)

plt.figure(figsize=(20,20))
sns.countplot(y="departamento", data=df_proce_importado, palette="Set3",order=order_percent.index.tolist())
Bogota D.C.              0.377309
Valle del Cauca          0.145119
Antioquia                0.087071
Cartagena D.T. y C       0.068602
Huila                    0.055409
Risaralda                0.034301
Cundinamarca             0.031662
Cesar                    0.031662
Quindio                  0.029024
Barranquilla D.E.        0.023747
Norte de Santander       0.023747
Cauca                    0.023747
Caldas                   0.018470
Santa Marta D.T. y C.    0.013193
Tolima                   0.010554
Bolivar                  0.005277
Boyaca                   0.005277
Cordoba                  0.002639
Sucre                    0.002639
Casanare                 0.002639
Atlantico                0.002639
San Andres               0.002639
Santander                0.002639
Name: departamento, dtype: float64
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1fbdc150>

What is the distribution of cities the top three of 'Departamentos'?

  • Bogota, cali, medellin, palmera , envigado and rionegro convers around 82% of all imported cases
In [25]:
DB_cases_copy = DB_cases.copy()
depar = ['Bogota D.C.','Valle del Cauca','Antioquia']

DB_cases_copy = DB_cases_copy[ DB_cases_copy['departamento'].isin(depar) ]

order_percent_city = DB_cases_copy.city.value_counts(normalize = True)

print(order_percent_city)


plt.figure(figsize=(15,8))
sns.countplot(y="city", data=DB_cases_copy, palette="Set3", order = order_percent_city.index.tolist())
Bogota                0.661783
Cali                  0.126984
Medellin              0.105006
Palmira               0.028083
Envigado              0.010989
Bello                 0.009768
Rionegro              0.009768
Cartago               0.007326
Yumbo                 0.006105
Itagui                0.004884
Tulua                 0.003663
Sabaneta              0.002442
Santa Rosa de Osos    0.002442
cali                  0.001221
Darien                0.001221
CALI                  0.001221
Ulloa                 0.001221
Montebello            0.001221
Frontino              0.001221
El Dovio              0.001221
El Retiro             0.001221
San Pedro             0.001221
La Union              0.001221
Apartado              0.001221
Buga                  0.001221
Trujillo              0.001221
Barbosa               0.001221
Guatape               0.001221
Guarne                0.001221
Don Matias            0.001221
Name: city, dtype: float64
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a205341d0>

What is the distribution of 'Departamento', 'state' and 'age'?

  • According to the following chart, as higher the age is, worst will be the state either ' Hospital', 'Hospital UCI' or 'Fallecido'
In [26]:
plt.figure(figsize=(15,8))
ax = sns.boxplot(x="departamento", y="age", hue="state",data=DB_cases_copy, palette="Set3")

Segmentation model

- This is an Unsupervised model that allow you to identify groups of registers with similar features. After identifing these groups, several strategies could be created and adjusted subject the group selected. The most common method is K-means. This one works really well for numerical values but not good for mixing data. that is why I used a data transformation method in order to take into account numerical and categoriacal variables. This method is named as Factor Analysis of mixed data (FAMD). 
In [27]:
DB_cases.columns
Out[27]:
Index(['ID_sql', 'ID', 'date', 'city', 'departamento', 'state', 'age', 'sex',
       'type', 'procedence'],
      dtype='object')
  • Data_k is created in order to save the data that will be used to run the model.
In [28]:
Data_k = DB_cases[['city','departamento','state','age','sex','type','procedence']]

FAMD method

  • As there are numerical variables such as 'Age' and categorical variables such as 'state', the best method to manage this mix is named: Factor analysis of mixed data that basically has as an output, numerical values by mixing numerical variables with the effect of the categorical variables.

In this case I created two different components and plot it

In [29]:
import prince

famd = prince.FAMD(
     n_components=2,
     n_iter=3,
     copy=True,
     check_input=True,
     engine='auto',
     random_state=42
 )
famd = famd.fit(Data_k)

l = famd.row_coordinates(Data_k)
print(l[0])
sns.relplot(x=l[0], y=l[1])
0      -1.167073
1      -0.663234
2       0.288694
3       0.605636
4      -1.040370
          ...   
1156   -0.815282
1157   -0.125676
1158   -2.053204
1159    0.415805
1160   -1.153327
Name: 0, Length: 1161, dtype: float64
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x1a202a3f50>
  • FAMD Components converted into Data frame
In [30]:
principalDf  = pd.DataFrame(data = l
             , columns = ['FAMD_component_1','FAMD_component_2'])
principalDf.shape

principalDf['FAMD_component_1'] = l[0]
principalDf['FAMD_component_2'] = l[1]
print(principalDf)
      FAMD_component_1  FAMD_component_2
0            -1.167073         -0.454289
1            -0.663234         -1.118854
2             0.288694         -1.519891
3             0.605636         -0.312398
4            -1.040370         -0.120406
...                ...               ...
1156         -0.815282         -0.460782
1157         -0.125676          0.792487
1158         -2.053204          1.000271
1159          0.415805          0.712282
1160         -1.153327         -0.421047

[1161 rows x 2 columns]

Implementation of method to determine the optimum number of clusters. Elbow method and silhouette method

  • According to elbow method, seems that the optimal number of cluster is 4 or 5
In [31]:
kmeans_kwargs = {
        "init": "random",
        "n_init": 10,
        "max_iter": 300,
        "random_state": 42,
    }
sse = {}
for k in tqdm(range(2, 30)):
    kmeans = KMeans(n_clusters=k, **kmeans_kwargs).fit(principalDf)
    sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
    
fig = go.Figure(data=go.Scatter(x=list(sse.keys()), y=list(sse.values())))
fig.show()
100%|██████████| 28/28 [00:05<00:00,  4.83it/s]
  • According to silhouette method, seems that the optimal number of cluster is 4
In [32]:
from sklearn.metrics import silhouette_score
# A list holds the silhouette coefficients for each k
silhouette_coefficients = []

# Notice you start at 2 clusters for silhouette coefficient
for k in range(2, 30):
    kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
    kmeans.fit(principalDf)
    score = silhouette_score(principalDf, kmeans.labels_)
    silhouette_coefficients.append(score)

plt.style.use("fivethirtyeight")
plt.plot(range(2, 30), silhouette_coefficients)
plt.xticks(range(2, 30))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()
  • Runing k-means using the two variables that I got using FAMD method and including, as a parameter, 4 clusters.
In [18]:
data = principalDf
for c in data.columns:
    pt = PowerTransformer()
    data.loc[:, c] = pt.fit_transform(np.array(data[c]).reshape(-1, 1))

kmeans = KMeans(n_clusters=4,max_iter= 1000000).fit(data)
kmeans_labels = kmeans.labels_

pd.Series(kmeans_labels).value_counts()
Out[18]:
2    334
0    285
1    275
3    267
dtype: int64
  • I assigned the result of the clustering to the current dataset with the components and I plotted in order to know if there is a clear separation among the clusters.
In [19]:
lgbm_data = principalDf.copy()
lgbm_data['Clusters_k_means'] = kmeans_labels
print(lgbm_data)
print(lgbm_data.columns)
sns.relplot(x='FAMD_component_1', y='FAMD_component_2', hue='Clusters_k_means', style='Clusters_k_means',data=lgbm_data, palette="Paired")
      FAMD_component_1  FAMD_component_2  Clusters_k_means
0            -1.166365         -0.578704                 0
1            -0.614439         -1.583330                 0
2             0.370681         -1.418086                 3
3             0.668208         -0.301070                 3
4            -0.996497         -0.088650                 0
...                ...               ...               ...
1156         -0.777636         -0.588427                 0
1157         -0.007200          1.000050                 2
1158         -2.285676          1.222819                 2
1159          0.496753          0.826223                 1
1160         -1.174910         -0.493215                 0

[1161 rows x 3 columns]
Index(['FAMD_component_1', 'FAMD_component_2', 'Clusters_k_means'], dtype='object')
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x11cc84910>
  • Finally, the clusters are assigned to the original data in order to know the effect of each variable.
In [22]:
fin_lgbm_data = DB_cases.copy()
fin_lgbm_data['Clusters_k_means'] = kmeans_labels
In [23]:
fin_lgbm_data.head()
Out[23]:
ID_sql ID date city departamento state age sex type procedence Clusters_k_means
0 0 1 2020-03-06 Bogota Bogota D.C. Recuperado 19 F Importado Italia 0
1 1 2 2020-03-09 Buga Valle del Cauca Recuperado 34 M Importado Espana 0
2 2 3 2020-03-09 Medellin Antioquia Recuperado 50 F Importado Espana 3
3 3 4 2020-03-11 Medellin Antioquia Recuperado 55 M Relacionado Colombia 3
4 4 5 2020-03-11 Medellin Antioquia Recuperado 25 M Relacionado Colombia 0

Effect of numerical variables

- Clearly all clusters could be identified by the range of age. Cluster 1 and 3 has higher range of age than 0 and 2 cluster. 
In [24]:
numerical_sig = ['age']
for i in numerical_sig:
    plt.figure(figsize=(15,8))
    ax = sns.boxplot(x="Clusters_k_means", y=i, data=fin_lgbm_data)
    ax2 = sns.catplot(x="Clusters_k_means", y=i, data=fin_lgbm_data)

Effect of categorical variables

- As cluster 1 and 3 are really similar regarding the range of age, those could be idenfied because cluster 3 considers just 'importado' people but cluster 1 considers 'Relacionado' and 'En estudio'
- As cluster 0 and 2 are really similar regarding the range of age, Those could be identified because cluster 2 considers 'Relacionado' or ' En estudio'and cluster 0 consider monstly 'Importado'. In addition, the distribution of F and M change between cluster 0 and 2. 
In [27]:
categorical_sig = ['state','sex','type']
for i in categorical_sig:
    plt.figure(figsize=(10,4)) #this creates a new figure on which your plot will appear
    sns.countplot(y="Clusters_k_means", hue=i, data=fin_lgbm_data)